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Abstract 



pq Background Dramatic increases in RNA structural data have made it possible 

q to recognize its conformational preferences much better than a decade ago. This has 

created an opportunity to use discrete restraint-based conformational sampling for mod- 
qh elling RNA and automating its crystallographic refinement. 

Results All-atom sampling of entire RNA chains, termini and loops is achieved 
using the Richardson RNA backbone rotamer library and an unbiased distribution for 
glycosidic dihedral angle. Sampling behaviour of Rappertk on a diverse dataset of 
Q\ RNA chains under varying spatial restraints is benchmarked. The iterative composite 

crystallographic refinement protocol developed here is demonstrated to outperform CNS- 
2 only refinement on parts of tRNA Asp structure. 

Conclusion This work opens exciting possibilities for further work in RNA modelling 
and crystallography. 



*This document is very similar to a chapter in SG's thesis submitted in Sept. 2007 to the University of 
Cambridge, England. 
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1 Introduction 



1.1 Role of RNA 



RNA is involved in many important biochemical functions involving genetic information, such 
as its storage (viral RNA), communication (mRNA) and modulation (snoRNA, microRNA). 
RNA also performs protein-like functions like enzymatic catalysis (ribosomal peptide bond 
formation - rRNA) and specific binding (amino-acid-specific tRNA) etc. It is believed to have 
played a major role in the early evolution of cellular life because it is functionally intermediate 
to proteins and DNA, exhibiting enzymatic activity as well as information storage and transfer 



(Voet and Voet (1995)). There is an increasing recognition of RNA's importance in cellular 



life (Schlick (2006)) and attempts to organize available experimental information as RNA 



ontology (Leontis et al. ( 2006[ )). 



1.2 RNA structure 

RNA is simpler than proteins in the sequence space due to a much smaller alphabet, but 
structurally it is more complicated. A typical nucleotide contains at least thrice as many 
non-hydrogen atoms as an amino acid residue. The most prominent parts of polynucleotide 
structures are nucleotide bases which are purines or pyrimidines. Purines adenosine and 
guanine are 5,6 aromatic rings and resemble tryptophan's sidechain. Pyrimidines uracyl and 
cytosine are aromatic 6-rings which resemble phenylalanine and tyrosine sidechains. The 
bases can undergo a variety of post-transcriptional modifications, increasing the effective 



number of base types (Dunin-Horkawicz et al. (2006)). A striking feature of DNA and (most) 
RNA structures is the common Watson-Crick pairing of purines with pyrimidines, and the 
associated base stacking. But in RNA structures, there are other non-canonical base interac- 
tions which contribute to stabilization of various RNA motifs (Leontis and Westhof ( 2003[ )). 
Bases are linked to 5-membered ribose sugar rings through glycosidic linkages. The x tor- 
sion angle, which describes base rotation with respect to sugar, is distributed around —120° 
or diametrically opposite to it, around —60° (Schneider et al. ( 2004[ )). Sugar ring connects 
bases to the backbone, and occurs only in two conformations, C3'-endo or C2'-endo. The 
phosphate-sugar backbone has six torsion angles (a, j3, 7, 5, e, () and much greater freedom 
than the protein mainchain. But conformational correlations in that space have been recog- 



nized recently (Duarte and Pyle (1998), Murray et al. (2003), Schneider et al. (2004)). 



Despite chemical differences, protein and RNA chains are logically similar. RNA backbone 
and protein mainchain are the unbranched chains in both polymers and show clear preferences 
for parts of their dihedral spaces. In proteins, mainchain completely determines Cp coordinate 
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and similarly, RNA backbone almost completely determines the sugar coordinates. Bases 
are similar to sidechains, because both are rotameric and confer chemical characteristics to 
respective polymers^ Thus, RNA backbone, sugar and bases are analogous respectively to 
protein mainchain, Cp atom and sidechains. 



1.3 RNA structure prediction 



Like other biopolymers, sequence data for RNA is far greater than 3D structural data. RNA 
crystals generally do not diffract as well as proteins because RNA is harder to purify and 
crystallize, possibly due to size and flexibility. Hence structure prediction methods are im- 
portant to bridge the sequence-structure gap. RNA structure prediction is done at two levels 
- secondary and 3D. Secondary structure prediction is important because it can help identify 
a variety of motifs like stem, hairpin loop, internal loop, junction loop, bulges and pseudo- 
knots. These predictions can prove to be important restraints to guide further 3D structure 
prediction. 3D structure prediction is important to locate interesting sites and tertiary inter- 



actions, but it has so far been dependent on secondary structure prediction (Shapiro et al. 



p007| )). 

Secondary structure prediction estimates the base pairings given a sequence. Due to stan- 
dard Watson-Crick base-pairing, RNA commonly exhibits helical stem regions. The sequence 
that connects the two strands in a stem is called a loop. Stem and loop arrangement can 
develop in a hierarchical fashion, giving rise to a structure that can be represented like a tree. 
Dynamic programming based algorithms like Mfold (Zuker et al. ( 1999[ )), Sfold (Mathews 



et al. ( 1999[ )), RNAstructure (Mathews et al. ( |2004[ )) assign secondary structure in such a 



way as to minimize the free energy for the sequencd^ Optimal and highly-ranked suboptimal 
solutions are very likely to contain the correct secondary structure. Suboptimal solutions can 



be filtered using Boltzmann sampling (Ding et al. (2004)) or abstract shape analysis (Steffen 



et al. ( 2006| ) to enrich the solutions of dynamic programming algorithms. In addition to 



dynamic programming, various other approaches have also been utilized such as genetic al- 
gorithms (Shapiro et al. (2001)) and Monte-Carlo sampling (Xayaphoummine et al. ( 2005[ )). 
All approaches can be further enhanced by using multiple sequence alignments, based on the 
information-theoretic principle that MSAs improve the signal to noise ratio. 

Tree-like simplicity of RNA secondary structure is lost when pseudoloops are formed by 
base-pairing of a stretch in loop with another strand. Pseudoloops are known to occur 
in many more complicated ways than the simplest H-type. They reduce flexibility of the 



x But base rotamericity is weaker and not used in this work. 

2 Free energies used here are experimentally determined as a function of host secondary structure type and 
base-pairing. 
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structure because often the stems involved in a pseudoloop are coaxially stacked. Dynamic 
programming algorithms which include general psudoknots scale poorly but simple H-type 



pseudoknots can be incorporated without loss of efficiency (Shapiro et al. (2007)). 

Fully-automated 3D structure prediction procedures are yet to be devised for RNA. This 
is perhaps due to the complexity of RNA structure and relatively less structural informa- 
tion as RNA is studied more often from a non-structural perspective. Present approaches 



encoded in programs like ERNA-3D (Zwieb and Muller (1997)), RNA2D3D (Yingling and 



Shapiro (2006)) and S2S (Jossinet and Westhof (2005)) are focussed on assisting the 3D 
model building exercise interactively. The inputs are a combination of known/predicted 
secondary structure, features derived from 3D structural data and available experimental 
restraints. The interactively assembled model is generally subjected to molecular dynamics 



refinement and minimization (Shapiro et al. (2007)). 

Recurrent 3D motifs in RNA structure are short sequence-dependent combinations of 
backbone conformations and base interactions. A complex set of noncovalent interactions 
stabilize them. Motif identification has not matured enough to be usable in 3D structure 



prediction (Leontis and Westhof (2003)) 



1.4 RNA crystallographic refinement 

RNA crystallography is harder than protein crystallography because nucleotides are bigger 
and more flexible than amino acid residues. RNA crystals rarely diffract better than 2A. Due 
to many high-quality protein structures, their statistical preferences can be used effectively 
to solve more protein structures. This critical mass effect is yet to be achieved for RNA as 
there are not enough structures for confident identification of backbone preferences and 3D 
motifs. Apart from their stand-alone utility, high-quality single-chain RNA structures are also 
essential for docking into low-resolution EM data of large complexes containing RNA chains. 

Temperature factors suggest that RNA flexibility is the least for paired bases and the 
highest for phosphates. Yet phosphates are also easy to detect due to greater electron density. 
Hence RNA crystallographer identifies bases and phosphates of RNA chain in the initial map 
and then iteratively completes and refines the structure. Due to lack of structural preferences, 
this process is manual, tedious and laborious. Methods and progress in RNA crystallography 



have been reviewed by Holbrook and Kim (1999) and Holbrook (2005). 



1.5 This work 

This work is inspired by the success of rapper's protein sampling which proved effec- 
tive in loop sampling, comparative modelling and automation of crystallographic refinement 
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( |de Bakker et al.\ ( |2003[ ), |DePristo et~al\ ( |2005D , |Furnham efal\ Q2006D ). It is the last task 
that would be very useful to the crystallographer if replicated for RNA. In protein crystallog- 
raphy, approximate locations of C a atoms and sidechains identified by the crystallographer 
are sufficient for rapper to reach an almost refined structure. It is expected that a similar 
approach would work for RNA chains too, given the approximate locations of phosphates and 
bases visually identifiable in the electron density. Apart from crystallographic use, a general- 
ized restraint-based all-atom sampler of RNA would be useful for generating decoy structures 
useful for benchmarking of energy functions. It would also allow generation of models with 
a prescribed sequence and secondary structure, and serve as a tool for generating 3D models 
of RNA motifs. 

In this work, we show that rapper's GABB (genetic algorithm using branch-and-bound 
technique) algorithm can be extended to RNA structures to sample it accurately and efficiently 
under a variety of positional restraints on backbone and bases. We also demonstrate the all- 
atom iterative crystallographic refinement of parts of a tRNA Asp structure. 



2 RNA tracing 



These benchmarks assess the utility of RNA sampling for the intended application of crys- 
tallographic refinement, hence the restraints chosen here reflect the kind of information a 
crystallographer can provide. Spherical positional restraints are used for phosphates (P 
atoms) and CA' atoms. Base planes are restrainted using a union-of-spheres restraint (base- 
plane restraint). This restraint is satisfied when the sampled set of coordinates lie within the 
union of given spheres. 



As described in Gore et al. (2007), Rappertk uses the Richardson rotamer library (Murray 



et al. ( 2003[ )) for RNA backbone sampling. Sugar-phosphate backbone consists of six dihedral 
angles : a (03^ - P t - 05' t - C5'J, P (Pi - Oh' % - C5' t - C4Q, 7 (05- - Ch' % - C4! % - 
C3J), 5 ( C5j - C4> - C 3j - 03',), e (C4> - - 03[ - P i+1 ) and C (C73J - 03\ - P l+l - 
05 - +1 ). |Murray et al.\ (|2003|) define the RNA backbone suite as a set of seven dihedral 



angles {5\ e\ (\ a l+1 , f3 t+1 , 7* +1 , 5 l+1 } and identify 42 distinct rotamers. In a recent effort, 
this library has been extended to 46 rotamers, with standard deviations specified for each 
cluster (J. M. Richardson, personal communication). Glycosidic dihedral x is defined over 
04- — Cl'i — iVl'j — C2'j for pyrimidines 04- — CI ■ — N9[ — C4- for purines, x preferences have 
not been rigorously analyzed, hence in this work it is randomly sampled between —180° and 
+ 180° in steps of 10°. 

The basic operation in RNA chain extension is building next or previous backbone suite 
by sampling a backbone suite rotamer and building sugar/base of present nucleotide using a 
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random sample for glycosidic linkage. Various styles of sampling use this building block in 
different ways. 



2.1 Sampling styles 

For iterative crystallographic refinement, basic operations over the RNA chain are rebuilding 
the whole chain, or its terminal (5' or 3') or an intermediate fragment (loop) : 

• Forward sampling (5' — > 3') is performed using the default RNA builder as described in 
Gore et o/.| ( p007| ). This builder depends on atoms (C5j, C% C3^) and yields (03j, P { 



05- +1 , C5- +1 , CA' i+1 ,C3' i+1 ) atoms (see Gore et al. (2007) for figure). It also builds the 
sugar and base of i th nucleotide. 



Bootstrapping required for sampling the whole chain is explained in Gore et al. (2007). 
It involves approximate positioning of (P, OIP, 02P, Ob', C5', C4f, C3') atoms of the 
first nucleotide. 

Backward sampling (3' — > 5') is performed by slightly changing the forward builder. The 
same backbone rotamers are sampled, but the builder depends on atoms (03-, C3' i7 C4Q 
to calculate coordinates for atoms (C5-, 05-, Pj_i, 03-_ 1; C3-_ 1; CA\_-^) (see Fig|T|. Sugar 
and base for i th nucleotide are also built. 

Loop sampling uses forward sampling. Nucleotides between and including start and end 
indices are rebuilt. Base of (start — l) th nucleotide is resampled within 2A positional 
restraints. Approximate loop closure is achieved by partial sampling of (end + l) th 
nucleotide's (P, C5', CA' , C3') atoms under similar restraints. Loop closure restraint is 
back-propagated by enforcing a spherical positional restraint centered at p end+l atom 
with radius 7 * (end + 1 — i) A on P l atom and also forcing it to remain 5A away from 
the p md + l atom. 



2.2 Initial observations 

For benchmarking of RNA tracing capabilities, we have used a set of diverse RNA chains 



compiled by Duarte and Pyle (1998) for their virtual dihedral analysis (summarized in Table 
[I]). In the first exercise, we restrained P,CA' atoms to 2A positional restraints and sampled 
only the backbones of the chains. But a model could be generated for only 12 of the 48 
chains. This suggested that the Richardson rotamer set consisting of only 46 states was too 
coarse-grained for the task being attempted. Indeed, 46 is a small number for capturing 
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Figure 1: Reverse RNA builder 




Table 1: Dataset of RNA chains used for the tracing exercise. 



PDB id 


Size a 


Filtered Size" 


PDB id 


Size a 


Filtered Size" 


li6u 


37 


11 


ljj2 


121 


18 


lkh6 


27 


10 


119a 


125 


15 


112x 


27 


10 


ln78 


75 


24 


2fmt 


77 


17 


361d 


19 


1 


lmzp 


55 


18 


lk8w 


21 


10 


lduh 


44 


19 


lkq2 


6 





lcxO 


71 


18 


lf7y 


56 


9 


lec6 


19 


6 


IcOa 


77 


37 


lm5k 


91 


20 


lkxk 


69 


11 


lb7f 


12 


3 


lhql 


48 


35 


lflt 


37 


4 


livs 


75 


6 


lcvj 


8 


1 


lgid 


158 


29 


lb23 


73 


13 


lqtq 


73 


28 


lddy 


35 


4 


ling 


96 


8 


le7x 


16 


16 


lf27 


18 


11 


let4 


35 


9 


Ijbs 


28 


11 


lglx 


39 


14 


lehz 


76 


26 


lf7u 


75 


23 


lhr2 


156 


19 


lhmh 


34 


11 


lmji 


33 


7 


ljbt 


28 


7 


Iffy 


74 


11 


lqf6 


76 


25 


le7k 


9 


4 


lm8x 


8 


3 


lntb 


21 


4 


lddl 


7 





lser 


64 


9 


lh4s 


67 


12 


429d 


12 


6 



a Size is the number of nucleotides in the chain as in the deposited PDB structure. 
^Filtered size is the size of the largest contiguous segment of rotameric backbone suites in 
the given chain. The Richardson RNA backbone rotamer set consists of 46 7-dihedral tuples, 
along with standard deviations for all dihedrals in each tuple. A backbone suite is rotameric 
if the largest single-dihedral difference between the suite and the closest Richardson backbone 
rotamer is < 30° or < 3a of that dihedral angle. 
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preferences of a flexible backbone consisting of 6 dihedral angles. Hence it was decided to 
supplement sampling by perturbation - after a rotamer is sampled, a random noise within 
la of the respective dihedrals is added to them. Standard deviations were kindly provided 
by J. M. Richardson (personal communication). When the latest exercise was repeated with 
perturbed sampling, at least one model could be generated for 47 of 48 examples. When 
perturbed backbone-only sampling was performed within tighter lA restraints on P, C4' 
atoms, this dropped to 36 of 48 chains. These failures could be traced to the non-rotameric 
backbone suites present in the chains. Then the longest stretch of good backbone suites was 
identified within every chain. A good suite was defined to be the one for which the largest 
single angle difference from the closest rotamer was within 30° or 3a for that angle. As seen 
from Table [TJ such good fragments are fairly small as compared to whole chain. 45 of 46 
such fragments could be sampled successfully under the same restraints. On these fragments, 
all-atom sampling was also possible within lA restraints on P, C4f atoms and 5A baseplane 
restraints. By dropping the C4' from these restraints, an increase in sampling time was 
observed, accompanied by a reduction in number of examples for which 10 models could be 
built (33 of 45). This is due to population dilution, which in this case is the reduction in 
number of members which will satisfy restraints for the base of next nucleotide. As expected, 
using stricter base restraint of 3A made the matters worse due to greater base restraint 
violations and no propagation of base restraints onto the backbone. Base restraint used here 
is hard to satisfy closely because a small error close to sugar amplifies towards the far end 
of the planar base. This problem can be addressed if given base restraint can be propagated 
onto C4' atom, but it is unclear at present how to achieve this. 

2.3 Sampling performance 

Two characteristics are desirable in a sampling process: (a) given tight restraints, sampling 
should be efficient and (b) given loose restraints, sampling should produce native-like con- 
formations owing to the knowledge of native structure incorporated in it. In other words, 
sampling cost should be directly proportional to length of the sampled fragment and inversely 
proportional to the restraint strictness. Sampling accuracy should be directly proportional 
to restraint radius. 

To check conformity with these expected traits, we carried out backbone-only sampling 
of filtered fragments under positional restraint of 1, 2 and 3A on phosphorous atom. Note 
that fragments may be the entire chains or at either terminus of the RNA chain or in between, 
hence this also tests corresponding sampling styles. All-atom sampling exercises were carried 
out under the same restraints on P atom and baseplane restraint of 5A on bases. 10 modelling 
attempts were made in each sampling exercise. A modelling attempt fails if it cannot produce 
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a model in 5 trials. Each trial uses backtracking, i.e. if sampling fails at a nucleotide, it is 
restarted from a position 3 nucleotides before it in the sampling order. In all-atom sampling, 
glycosidic linkage (x dihedral) is sampled uniformly over the entire range at 10° intervals, 
van der Waals radii of base and sugar atoms are reduced by 50%. Sampling performance is 
quantified by measuring the average RMSD of models and average time taken to produce a 
model as functions of fragment size, restraint radius and whether bases are modelled. 

The time plots (Figj2| suggest a linear correlation between fragment size and sampling 
time for both backbone-only and all-atom models, hence lines of best fit have also been 
plotted. Regression coefficients of these lines are informative. In both cases, regression 
coefficients suggest that the time needed for sampling with P restraint radius of 2A is twice 
as much as that with 3A and four times as much with lA as with 2A. Comparison of the 
regression coefficients in backbone and all-atom cases suggest that latter is nearly ten times 
costlier than the former. 

The RMSD plots (Fig|3]) suggest a weak correlation between the RMSD and fragment sizes, 
i.e. RMSD is lower for smaller fragments with the same restraint radius. This prompted the 
fitting of a log curve. RMSD falls with the size of P restraint. For each restraint size, all-atom 
RMSD is more than backbone RMSD. Interestingly, the backbone RMSD in all-atom case is 
better than that in backbone-only case, indicating the influence of base restraint in guiding 
the backbone. 

Thus, the main sampling trends are: (a) smaller restraint radius leads to greater sampling 
time, (b) all-atom sampling is costlier than backbone-only, but leads to backbones with less 
RMSD (c) sampling time is proportional to fragment size and (d) RMSD tends to be smaller 
for smaller fragments. These trends are expected from previous experience with protein 
sampling exercises. But there are significant differences too, due to differences in restraint 
density. In protein C a tracing, backbone sampling models 3iV atoms under N positional 
restraints (ignoring carbonyl oxygen), hence positional restraint density is |. In the RNA 
backbone tracing exercise carried out here, this density is | (ignoring phosphate oxygens 
OIP, 02P). This is reflected in the backbone RMSD: in the case of proteins, backbone RMSD 
is generally lower than the C a restraint radius but it is generally higher for RNA than the 
P restraint radius. Another difference is rotamericity of protein sidechains and lack of it in 
glycosidic linkages. This is indicated by lower all-atom RMSD for proteins than RNA chains 
under similar restraints. To sum up, trends observed in RNA sampling are expected and 
satisfactory enough to attempt application to the crystallographic scenario. 



10 



Figure 2: Variation in sampling time with RNA fragment size. The following scatter plots 
(above backbone-only, below all-atom) indicate a linear correlation between average sampling 
times and fragment sizes. Regression coefficients of lines of best fit suggest that sampling 
time nearly doubles from 3A to 2A and almost quadruples from 2A to lA. Similarly, all-atom 
sampling is roughly a magnitude costlier than backbone-only case. Note that 3 outliers have 
not been considered for the lA backbone-only plot and 4 fragments did not yield any model 
during lA all-atom sampling. 
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Figure 3: Variation in sampling accuracy with RNA fragment size. The following plots (top- 
lA, middle-2A, bottom-3A) show relationship between average RMSD of models of fragments 
and fragment lengths. There is a weak tendency to have lower RMSD for lower lengths, hence 
a log curve was fitted for each scatter. In general, at a given P restraint radius, all-atom 
models have better backbone RMSD than backbone-only models. All-atom RMSD is slightly 
greater than backbone RMSD in all-atom models. RMSD increases as restraint radius increases. 
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3 Foray into crystallographic refinement 



3.1 About tRNA structure 



Transfer RNAs are classic structures from the 1970s. Till mid-90s, structures of tRNA (Hingerty 



et al.\ ( |1978[ ), |Sussman et al.\ ( |1978[ ), |Westhof et al.\ ( |1988[ ), |Westhof and Sundaralingam 



(1986)) were the only large RNA structures in pdb (Shi and Moore (2000)), making them 
remarkable achievements of crystallography techniques of that decade. tRNA is a cloverleaf- 
shaped molecule in its secondary structure representation and has a L-shaped 3D form (FigjiJ). 
tRNA is an essential cog in the translational machinery of the cell which incrementally trans- 
lates the transcripted mRNA into peptide chain one residue at a time. Ribosome finds a 
tRNA with a 3-nucleotide anticodon complementary to current mRNA codon. This tRNA 
has an amino acid attached to its 5' end, which the ribosome then attaches to the growing 
polypeptide. 

tRNA structures are attractive for demonstrating crystallographic utility of discrete restraint- 
based RNA sampling because they are neither too small nor too large, are structurally well- 
studied and have 3 loop regions (anticodon loop, TipC loop, D loop) with non- Watson-Crick 



base pairing. For this work, tRNA Asp structure was used, solved at 3A by Westhof et al. 



( |1988[ ). This structure (pdb 2tra) refines to R/R fr ee of 0.2552/0.3063 with CNS starting 
from deposited structure and data. 



3.2 Composite refinement protocol 

Similar to composite refinement protocols used earlier in the thesis, this work also uses 
perturbed starting structures and rebuilds them with the aim of improving Rf ree - In brief, 
Rappertk identifies the ill-fit nucleotides by calculating the correlation coefficient between F c 
map and a^-weighted CNS omit map for regions around the backbone, sugar/base and entire 
nucleotide. Low (< 0.9) correlation coefficient indicates nucleotide stretches to rebuild, which 
are then built incrementally using GABB algorithm. Ten times more children are generated 
as the population size, and top 10% are retained based on their electron density occupation 
score, leading to an enriched population. Resampled nucleotides get a S-factor of 30 assigned 
to all of their atoms. Non-RNA atoms (ligands and waters) are not used during sampling. 
Best member of population (according to density occupation) is written out as the new model 
along with non-RNA atoms appended to it. The coordinates and 5-factors of non-RNA atoms 
are copied from the previous refinement iteration. This model is refined with CNS (2 rounds of 
MDSA starting at 5000i^, intervened by a 200-step minimization). This procedure is repeated 
for 10 iterations. It is expected that RNA models generated with rotameric backbone states 
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Figure 4: tRNA structure: the schematic diagram shows the typical secondary structure of 
tRNA. 3D representation below it shows all-atom and cartoon representation of tRNA^ as 
in PDB entry 2tra jWesthof et ~ai\ (]1988[)). 



3 'end 
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to obey given positional restraints, positive electron density restraints and excluded volume 
restraints would be within the convergence radius of CNS, i.e. such models can be used to 
assist CNS in finding well-refined structures, starting from ill-fitting ones. 

3.3 Refining a helical fragment 

CNS refinement was performed initially on the anticodon loop (nucleotides 33 — 37) and the 
TipC loop (nucleotides 54 — 60), starting from models where the loops were perturbed by RNA 
tracing within tight positional restraints (P, baseplane restraints of 2A, 5 A respectively). In 
both cases we observed that CNS was able to correct the errors introduced in the native 
structure. This was in contrast to proteins where similar positional restraints on C a and 
sidechains generally result in unsatisfactory CNS-only refinement. But removal of baseplane 
restraints from the RNA trace deteriorated the refinement quality. This suggested that CNS 
convergence radius is larger for RNA structures than proteins, and Rappertk sampling may 
be of value only in cases where spatial information about the structure is highly uncertain. 

In order to use a simple example to begin with, a fragment in RNA duplex (nucleotides 
23 — 27) was chosen, with clear base densities. Initial perturbation was carried out with 2A P 
restraints and no base restraints to generate 5 models. The perturbed models were subjected 
to CNS refinement only. In 4 of 5 cases, CNS refinement was unsatisfactory. 3 such cases 
are shown in Fig|5j When the composite refinement protocol was applied to the same region 
with the same starting models, all trajectories resulted in well-refined structures (Fig|6]). The 
mean of best Rf ree values in CNS-only trajectories was 0.311 as compared to 0.304 for the 
composite protocol trajectories. It is interesting to note that Rf ree does not strongly reflect 
the salient differences in the refinement trajectories indicated by Fig|5]and FigJH] 

3.4 Refining the TipC loop 

The same exercise was repeated for nucleotides 54 — 60, the TipC loop. The native density 
for this loop is not as good as the helical fragment (see Figj7]). CNS-only refinement resulted 
in mean best Rf ree of 0.316 over the 5 refinement attempts, whereas the same for composite 
refinement was 0.303. Visual inspection of these models shows the greater variability in the 
CNS models and that each attempt was stuck in a local minimum. 3 of 5 composite models 
refined to a structure very similar to native, but the rest were trapped in a local minima. 
Close observation of these 2 cases revealed that spurious density appearing elsewhere led 
Rappertk sampling away from the native. 
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Figure 5: CNS-only refinment of tRNA Asp (PDB 2tra) can be unsatisfactory. Starting models 
were generated by perturbing a 5 nucleotide fragment (23-27) with 2A P restraints and no 
base restraints. Top-left panel shows the native structure of the fragment with its omit map 
contoured at 2a. Other 3 panels show the best Rf ree structures in 3 different CNS-only 
refinement trajectories, with respective omit maps also contoured at 2a. This suggests that 
CNS can get trapped in local minima in case of high initial structural uncertainty. Note 
that the CNS refinement here is with minimal restraints, i.e. hydrogen-bonding restraints 
between base-pairs were not provided to CNS. 
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Figure 6: Composite CNS / Rappertk refinement of a helical fragment from tRNA" 4 ^. Spherical 
positional restraints of radius 2A were used around P atoms of the 5-nucleotide (23-27) frag- 
ment from PDB 2tra. No restraints were imposed on bases. The CNS / Rappertk refinement 
resulted in satisfactory refinement in all 5 attempts. Best Rf ree models in each trajectory 
are shown in magenta, with the deposited structure in sticks representation. 
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Figure 7: Composite CNS/ Rappertk refinement of the 7- nucleotide TipC loop (54-60) from 
tRNA Asp with 2A P restraints and no base restraints. The top panel shows the native fragment 
with its omit map density. Middle panel shows the best Rf ree models of CNS-only refinement. 
Bottom panel shows the same for the composite refinement. Native structure (green) is shown 
for reference in middle and bottom panels. 
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3.5 Refining the anticodon loop 

Anticodon loop spans nucleotides 33 — 37, of which G — 34, U — 35 and C — 36 have two 
equally occupied states in the 2tra structure. Initial attempts to repeat the previous exercise 
on this loop were unsatisfactory because Rappertk tried to fit a single conformation to these 
heterogenous nucleotides. Due to this, we created artificial diffraction data at the same 
resolution by considering only the first conformation of each nucleotide and assigning full 
occupancy to it. This significantly changed the refinement trajectories and a similar trend as 
previous two exercises could be observed. For five CNS-only and composite refinements, mean 
best Rf re e values were 0.254 and 0.215 respectively, indicating a much improved refinement 
with the composite protocol. Fig|8] shows that the composite protocol yields almost identical 
structures and CNS-only refinement gets trapped in different local minima. 



3.6 Typical problems with refinement protocols 

There are two main reasons for suboptimal CNS-only refinement, identifiable from successive 
structures in the refinement trajectories (Figj9j. Firstly, if a base is very far away from its 
native-like location, CNS refinement does not restore it. Secondly, a base may get trapped 
into densities of phosphate, sugar or another base, in which case even if the base is not too 
far away, it is difficult to restore it. This is reminiscent of bulky misplaced sidechains in 
protein crystallographic refinement. 

Structure trajectories suggest that improved refinement with CNS / Rappertk protocol must 
be due to relocation of bases by RNA sampling, which is brought about by the electron-density 
based enrichment of incremental building using rotameric backbones. A typical corrective 
rebuilding step is shown in Fig, 10 The obvious mistakes in base placement are corrected 
with Rappertk whereas CNS carries out small corrections to take the conformations towards 
the optimal. 

There are some imperfections in the Rappertk sampling scheme which may sometimes lead 
to incorrect final structures: (a) RNA is very flexible and population size of 300 and enrichment 
factor of 10 may not be sufficient (b) Lack of x preferences means that selective pressure due to 
bases is low - it is further weakened in case of weak base density (c) Collateral damage may be 
caused by CNS refinement of a defective loop, e.g. perturbations in nearby regions of structure 
or symmetry-related copies do not get repaired during Rappertk rebuilding step leading to 
higher Rf ree (d) Scoring scheme based on maximizing the electron density occupation may 
promote occupation of sharp peaks like waters and phosphates although there are obvious 
dissimilarities between such peaks and the shape of a base. 
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Figure 8: Composite CNS/ Rappertk refinement of the 5-nucleotide anticodon loop (33-37) 
with 2A P restraints and no base restraints. The top panel shows the native fragment with 
its omit map density. Middle panel shows the best Rf ree models of CNS-only refinement. 
Bottom panel shows the same for the composite refinement. Native structure (green) is 
shown for reference in middle and bottom panels. 
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Figure 9: CNS refinement of RNA can get trapped in local minima. An instance of CNS-only 
refinement yields a structure (green) very different from native (magenta). Corresponding 
2F Q — F c omit map is shown contoured at lex around a three nucleotide stretch (green sticks, 
nucleotides 54-56) only for clarity. Corresponding nucleotides in the two structures are shown 
with arrows. CNS model has bases too far away from native locations and also occupying 
the wrong density. Also note the appearance of density around wrong bases. 
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Figure 10: A typical corrective step carried out by Rappertk RNA sampling with rotameric 
backbone and density enrichment. Blue is the native structure, green is a perturbed and 
CNS-refined model. Magenta is the Rappertk model found using the positional restraints and 
omit map of the green model. Note that Rappertk model removes gross errors yet small 
errors may remain in comparison to native. 
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4 Conclusion 



This work suggests that knowledge-based sampling can be applied efficiently and productively 
to RNA structures. GABB algorithm was extended to sampling of RNA chains at 5'-end, 3'-end 
and intermediate regions. Modified nucleotides were incorporated in addition to standard 
ones for all-atom sampling. Using a 48-chains dataset, we showed that sampling performance 
is along expected lines and suggests its suitability for real-world applications like crystallog- 
raphy. Then we demonstrated that a helical strand, the TipC loop and the anticodon loop in 
the tRNA Asp structure can be automatically sampled and iteratively refined using crystallo- 
graphic data. It was found that the composite CNS/ Rappertk protocol yields structures better 
refined than those by the CNS-only protocol. Shortcomings of both protocols were discussed. 
This work shows that automated crystallographic refinement of RNA chains is possible given 
the approximate trajectory of phosphates. This is a promising result for reducing manual 
effort and allowing exploration of multiple conformations. 

Yet some concerns remain and must be addressed in future work. Sampling preferences 
themselves are imperfect. A-form conformation is adopted by more than 50% RNA suites 
but population frequencies for rest of the backbone rotamers are unclear. Hence we have 
used equal weights for all suite rotamers. For similar reasons, we have not used the weakly 
bimodal nature of the glycosidic linkage. A careful analysis of available structural data will 
be required before incorporating such preferences reliably, because sampling preferences are 
meant to bias the conformational search and not restrict it. Another improvement necessary 
for quicker sampling is the propagation of phosphate and base restraint onto the backbone 
(e.g. on C4') so that base restraint satisfaction becomes more likely. At present this is a 
sampling bottleneck. 

There are a few promising ways to extend this work. Firstly, whole-chain crystallographic 
refinement of RNA structures can be performed for low resolution structures to reduce the 
number of non-rotameric suites. Secondly, RNA sampling can be used to generate 3D all-atom 
conformations for secondary structures or motifs by expressing the base pairing/stacking 
interactions as distance restraints. All conformations sampled to satisfy these restraints will 
be useful in 3D structure prediction which is, as noted before, a process of assembling 3D 
coordinates of predicted secondary structures. Finally, protein and RNA sampling can be 
combined together for automating the crystallography of protein-RNA complexes, especially 
the very large ones like ribosomes so that human attention will be required only in the early 
and late stages of refinement. 
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